DisC Diversity: Result Diversification based on 
Dissimilarity and Coverage * 
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ABSTRACT 

Recently, result diversification has attracted a lot of atten- 
tion as a means to improve the quality of results retrieved by 
user queries. In this paper, we propose a new, intuitive defi- 
nition of diversity called DisC diversity. A DisC diverse sub- 
set of a query result contains objects such that each object 
in the result is represented by a similar object in the diverse 
subset and the objects in the diverse subset are dissimilar 
to each other. We show that locating a minimum DisC di- 
verse subset is an NP-hard problem and provide heuristics 
for its approximation. We also propose adapting DisC di- 
verse subsets to a different degree of diversification. We call 
this operation zooming. We present efficient implementa- 
tions of our algorithms based on the M-tree, a spatial index 
structure, and experimentally evaluate their performance. 

1. INTRODUCTION 

Result diversification has attracted considerable attention 
as a means of enhancing the quality of the query results pre- 
sented to users fe.g.. |25l 131] ). Consider, for example, a user 
who wants to buy a camera and submits a related query. A 
diverse result, i.e., a result containing various brands and 
models with different pixel counts and other technical char- 
acteristics is intuitively more informative than a homoge- 
neous result containing only cameras with similar features. 

There have been various definitions of diversity [10], based 
on (i) content (or similarity), i.e., objects that are dissim- 
ilar to each other (e.g., [31]), (ii) novelty, i.e., objects that 
contain new information when compared to what was pre- 
viously presented (e.g., [9]) and (iii) semantic coverage, i.e., 
objects that belong to different categories or topics (e.g., 
[3]). Most previous approaches rely on assigning a diversity 
score to each object and then selecting either the k objects 
with the highest score for a given k (e.g., [4] [14]) or the 
objects with score larger than some threshold (e.g., 28 ). 
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In this paper, we address diversity through a different 
perspective. Let V be the set of objects in a query result. 
We consider two objects p\ and p2 in V to be similar, if 
dist(pi,p2) < r for some distance function dist and real 
number r, where r is a tuning parameter that we call ra- 
dius. Given V , we select a representative subset S C V to 
be presented to the user such that: (i) all objects in V are 
similar with at least one object in S and (ii) no two objects 
in S are similar with each other. The first condition en- 
sures that all objects in V are represented, or covered, by at 
least one object in the selected subset. The second condi- 
tion ensures that the selected objects in V are dissimilar, or 
independent. We call the set S r-Dissimilar and Covering 
subset or r-DisC diverse subset. 

In contrary to previous approaches to diversification, we 
aim at computing subsets of objects that contain objects 
that are both dissimilar with each other and cover the whole 
result set. Furthermore, instead of specifying a required 
size k of the diverse set or a threshold, our tuning pa- 
rameter r explicitly expresses the degree of diversification 
and determines the size of the diverse set. Increasing r re- 
sults in smaller, more diverse subsets, while decreasing r 
results in larger, less diverse subsets. We call these oper- 
ations, zooming-out and zooming-in respectively. One can 
also zoom-in or zoom-out locally to a specific object in the 
presented result. 

As an example, consider searching for cities in Greece. 
Figure [T] shows the results of this query diversified based on 
geographical location for an initial radius (a) , after zooming- 
in (b), zooming-out (c) and local zooming-in a specific city 
(d). As another example of local zooming in the case of 
categorical attributes, consider looking for cameras, where 
diversity refers to cameras with different features. Figure [2] 
depicts an initial most diverse result and the result of local 
zooming-in one individual camera in this result. 

We formalize the problem of locating minimum DisC di- 
verse subsets as an independent dominating set problem on 
graphs [17] . We provide a suite of heuristics for computing 
small DisC diverse subsets. We also consider the problem of 
adjusting the radius r. We explore the relation among DisC 
diverse subsets of different radii and provide algorithms for 
incrementally adapting a DisC diverse subset to a new ra- 
dius. We provide theoretical upper bounds for the size of 
the diverse subsets produced by our algorithms for com- 
puting DisC diverse subsets as well as for their zooming 
counterparts. Since the crux of the efficiency of the pro- 
posed algorithms is locating neighbors, we take advantage 
of spatial data structures. In particular, we propose efficient 





fa) Initial set. 



(b) Zooming-in. 





(c) Zooming-out. (d) Local zooming-in. 

Figure 1: Zooming operations in action. Selected 
objects are shown in bold. Solid circles denote the 
radius r of the selected objects. 

algorithms based on the M-tree |29| . 

We compare the quality of our approach to other diver- 
sification methods both analytically and qualitatively. We 
also evaluate our various heuristics using both real and syn- 
thetic datasets. Our performance results show that the ba- 
sic heuristic for computing dissimilar and covering subsets 
works faster than its greedy variation but produces larger 
sets. Relaxing the dissimilarity condition, although in the- 
ory could result in smaller sets, in our experiments does not 
reduce the size of the result considerably. Our incremental 
algorithms for zooming in or out to a different radius r', 
when compared to computing a DisC diverse subset for r' 
from scratch, produce sets of similar sizes and closer to what 
the user intuitively expects, while imposing a smaller com- 
putational cost. Finally, we draw various conclusions for the 
M-tree implementation of these algorithms. 

Most often diversification is modeled as a bi-criteria prob- 
lem with the dual goal of maximizing both the diversity and 
the relevance of the selected results. In this paper, we focus 
solely on diversity. Since we "cover" the whole dataset, each 
user may "zoom-in" to the area of the results that seems 
most relevant to her individual needs. Of course, many other 
approaches to integrating relevance with DisC diversity are 
possible; we discuss some of them in Section [8] 

In a nutshell, in this paper, we make the following contri- 
butions: 

- we propose a new, intuitive definition of diversity, called 
DisC diversity and compare it with other models, 

- we show that locating minimum DisC diverse subsets 
is an NP-hard problem and provide efficient heuristics 
along with approximation bounds, 

- we introduce adaptive diversification through zooming- 
in and zooming-out and present algorithms for their 
incremental computation as well as corresponding the- 
oretical bounds, 

- we provide M-tree tailored algorithms and experimen- 
tally evaluate their performance. 
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Figure 2: Zooming in a specific camera. 

The rest of the paper is structured as follows. Section [2] 
introduces DisC diversity and heuristics for computing small 
diverse subsets, Section[3]introduces adaptive diversification 
and Section 0] compares our approach with other diversifi- 
cation methods. In Section [H we employ the M-tree for 
the efficient implementation of our algorithms, while in Sec- 
tion [S] we present experimental results. Finally, Section [7] 
presents related work and Section [8] concludes the paper. 

2. DISC DIVERSITY 

In this section, we first provide a formal definition of DisC 
diversity. We then show that locating a minimum DisC 
diverse set of objects is an NP-hard problem and present 
heuristics for locating approximate solutions. 

2.1 Definition of DisC Diversity 

Let V be a set of objects returned as the result of a user 
query. We want to select a representative subset S of these 
objects such that each object from V is represented by a 
similar object in S and the objects selected to be included 
in S are dissimilar to each other. 

We define similarity between two objects using a distance 
metric diat. For a real number r, r > 0, we use N r (pi) to 
denote the set of neighbors (or neighborhood) of an object 
Pi £ V, i.e., the objects lying at distance at most r from pc 

N r (pi) = {pj | Pi Pj A dist(jpi,pj) < r} 

We use N^~(pi) to denote the set N r {pi) U {pt} : i.e., the 
neighborhood of Pi including pi itself. Objects in the neigh- 
borhood of pi are considered similar to pi, while objects 
outside its neighborhood are considered dissimilar to pi . We 
define an r-DisC diverse subset as follows: 

Definition 1. (r-DisC Diverse Subset) Let V be a 
set of objects and r, r > 0, a real number. A subset SC? 
is an r-Dissimilar-and-Covering diverse subset, or r-DisC 
diverse subset, of V , if the following two conditions hold: 
(i) (coverage condition) Vpi £ V , 3pj £ N^~(pi), such that 
Pj £ S and (ii) (dissimilarity condition) V pi, Pj £ S with i 
7^ j, it holds that dist(pi,pj) > r. 

The first condition ensures that all objects in V are rep- 
resented by at least one similar object in S and the second 
condition that the objects in S are dissimilar to each other. 
We call every object pi £ S an r-DisC diverse object and r 
the radius of S. When the value of r is clear from context, 
we simply refer to r-DisC diverse objects as diverse objects. 
Given V, we would like to select the smallest number of 
diverse objects. 




Figure 3: (a) Minimum r-DisC diverse subsets 
for the depicted objects: {pi,p4,pr}, {p2,p4,pr}, 
{p 3 ,ps,Pe}, {P3,P5,P7} and (b) their graph represen- 
tation. 

Definition 2. (The Minimum r-DisC Diverse Subset 
Problem) Given a set V of objects and a radius r, find an 
r-DisC diverse subset S* of V , such that, for every r-DisC 
diverse subset S ofV, it holds that \S*\ < \S\. 

In general, there may be more than one minimum r-DisC 
diverse subsets of V (see Figure E ^a)| for an example) . 

2.2 Graph Representation and NP-hardness 

Let us consider the following graph representation of a set 
V of objects. Let Gv,r = (V, E) be an undirected graph 
such that there is a vertex v% € V for each object pi £ V 
and an edge (vi,Vj) £ E, if and only if, dist(pi,pj) < r for 
the corresponding objects pi, pj. An example is shown in 
Figure [fb)1 

Let us recall a couple of graph-related definitions. A dom- 
inating set D for a graph G is a subset of vertices of G such 
that every vertex of G not in D is joined to at least one 
vertex of D by some edge. An independent set I for a graph 
G is a set of vertices of G such that for every two vertices in 
J, there is no edge connecting them. It is easy to see that 
a dominating set of Gv,r satisfies the covering conditions of 
Definition [1] whereas an independent set of Gv,r satisfies 
the dissimilarity conditions of Definition [T] Thus: 

Observation 1. Solving the Minimum r-DisC Diverse 
Subset Problem for a set V is equivalent to finding a Min- 
imum Independent Dominating Set of the corresponding 
graph Gv.r- 

The Minimum Independent Dominating Set Problem 
has been proven to be NP-hard [15]. The problem remains 
NP-hard even for special kinds of graphs, such as for unit 
disk graphs |Hj. Unit disk graphs are graphs whose vertices 
can be put in one to one correspondence with equisized cir- 
cles in a plane such that two vertices are joined by an edge, 
if and only if, the corresponding circles intersect. G-p,r is a 
unit disk graph for Euclidean distances. 

In the following, we use the terms dominance and cover- 
age, as well as, independence and dissimilarity interchange- 
ably. In particular, two objects pi and pj are independent, if 
dist(pi,pj) > r. We also say that an object covers all objects 
in its neighborhood. We next present some useful properties 
that relate the coverage (i.e., dominance) and dissimilarity 
(i.e., independence) conditions. A maximal independent set 
is an independent set such that adding any other vertex to 
the set forces the set to contain an edge, that is, it is an in- 
dependent set that is not a subset of any other independent 
set. It is known that: 

Lemma 1. An independent set of a graph is maximal, if 
and only if, it is dominating. 




(a) (b) 

Figure 4: (a) Minimum dominating set ({v2,v$}) 
and (b) a minimum independent dominating set 
({v2,V4,ve}) for the depicted graph. 

From Lemma[T] we conclude that: 

Observation 2. A minimum maximal independent set is 
also a minimum independent dominating set. 

However, 

Observation 3. A minimum dominating set is not nec- 
essarily independent. 

For example, in Figured] the minimum dominating set of 
the depicted objects is of size 2, while the minimum inde- 
pendent dominating set is of size 3. 

2.3 Computing DisC Diverse Objects 

We consider first a baseline algorithm for computing a 
DisC diverse subset S of V ■ For presentation convenience, 
let us call black the objects of V that are in S, grey the 
objects covered by S and white the objects that are neither 
black nor grey. Initially, S is empty and all objects are white. 
The algorithm proceeds as follows: until there are no more 
white objects, it selects an arbitrary white object pi, colors 
Pi black and colors all objects in N r (pi) grey. We call this 
algorithm Basic-DisC. 

The produced set S is clearly an independent set, since 
once an object enters S, all its neighbors become grey and 
thus are withdrawn from consideration. It is also a maxi- 
mal independent set, since at the end there are only grey 
objects left, thus adding any of them to S would violate the 
independence of S. From Lemma [TJ the set S produced by 
Basic-DisC is an r-DisC diverse subset. S is not necessarily 
a minimum r-DisC diverse subset. However, its size is re- 
lated to the size of any minimum r-DisC diverse subset S* 
as follows: 

Theorem 1. Let B be the maximum number of indepen- 
dent neighbors of any object inV . Any r-DisC diverse subset 
SofVis at most B times larger than any minimum r-DisC 
diverse subset S* . 

Proof. Since S is an independent set, any object in S* 
can cover at most B objects in S and thus |£| < □ 

The value of B depends on the distance metric used and 
also on the dimensionality d of the data space. For many 
distance metrics B is a constant. Next, we show how B 
is bounded for specific combinations of the distance metric 
and data dimensionality d. 

Lemma 2. // dist is the Euclidean distance and d = 2, 
each object pi in V has at most 5 neighbors that are inde- 
pendent from each other. 

Proof. Let pi, P2 be two independent neighbors of pi. 
Then, it must hold that ZpipiP2 is larger than ^. Other- 
wise, dist{p\,p2) < m&x{dist(pi,pi),dist(pi,p2)} < r which 
contradicts the independence of pi and p2. Therefore, pi can 
have at most (2-7r/-|) — 1=5 independent neighbors. □ 



Algorithm 1 Greedy-DisC 



Input: A set of objects V and a radius r. 
Output: An r-DisC diverse subset S of V ■ 



1: 


S <- 


2: 


for all pi dV do 


3: 


Color pi white 


l . 

4. 


end for 


5: 


while there exist white objects do 


6: 


Select the white object p; with the largest (^^(pi)] 


7: 


S = Su{ Pi } 


8: 


Color pi black 


9: 


for all Pj e N^( Pi ) do 


10: 


Color pj grey 


11: 


end for 


12: 


end while 


13: 


return S 



Lemma 3. // dist is the Manhattan distance and d — 2, 
each object pi in V has at most 7 neighbors that are inde- 
pendent from each other. 

Proof. The proof can be found in the Appendix. □ 

For d = 3 and the Euclidean distance, it can be shown that 
each object pi in V has at most 24 neighbors that are inde- 
pendent from each other. This can be shown using packing 
techniques and properties of solid angles 

We now consider the following intuitive greedy variation 
of Basic-DisC, that we call Greedy-DisC. Instead of select- 
ing white objects arbitrarily at each step, we select the white 
object with the largest number of white neighbors, that is, 
the white object that covers the largest number of uncov- 
ered objects. Greedy-DisC is shown in Algorithm [TJ where 
(pi) is the set of the white neighbors of object pi. 

While the size of the r-DisC diverse subset 5* produced 
by Greedy-DisC is expected to be smaller than that of the 
subset produced by Basic-DisC, the fact that we consider 
for inclusion in 5 only white, i.e., independent, objects may 
still not reduce the size of 5 as much as expected. From 
Observation [3] it is possible that an independent covering 
set is larger than a covering set that also includes dependent 
objects. For example, consider the nodes (or equivalently 
the corresponding objects) in Figure 3] Assume that object 
V2 is inserted in 5* first, resulting in objects vi, V3 and 1)5 
becoming grey. Then, we need two more objects, namely, V4 
and ve, for covering all objects. However, if we consider for 
inclusion grey objects as well, then V5 can join S, resulting 
in a smaller covering set. 

Motivated by this observation, we also define r-C diverse 
subsets that satisfy only the coverage condition of Defini- 
tion[T]and modify Greedy-DisC accordingly to compute r-C 
diverse sets. The only change required is that in line 6 of 
Algorithm [T] we select both white and grey objects. This 
allows us to select at each step the object that covers the 
largest possible number of uncovered objects, even if this 
object is grey. We call this variation Greedy-C. In the case 
of Greedy-C, we prove the following bound for the size of 
the produced r-C diverse subset S: 

Theorem 2. Let A be the maximum number of neigh- 
bors of any object in V . The r-C diverse subset produced 
by Greedy-C is at most In A times larger than the minimum 
r-DisC diverse subset S* . 

Proof. The proof can be found in the Appendix. □ 




(a) (b) 

Figure 5: Zooming (a) in and (b) out. Solid and 
dashed circles correspond to radius r' and r respec- 
tively. 

3. ADAPTIVE DIVERSIFICATION 

The radius r determines the desired degree of diversifica- 
tion. A large radius corresponds to fewer and less similar to 
each other representative objects, whereas a small radius re- 
sults in more and less dissimilar representative objects. On 
one extreme, a radius equal to the largest distance between 
any two objects results in a single object being selected and 
on the other extreme, a zero radius results in all objects of 
V being selected. We consider an interactive mode of op- 
eration where, after being presented with an initial set of 
results for some r, a user can see either more or less results 
by correspondingly decreasing or increasing r. 

Specifically, given a set of objects V and an r-DisC diverse 
subset S r of V, we want to compute an r'-DisC diverse 
subset S r of V . There are two cases: (i) r' < r and (ii) r' > 
r which we call zooming-in and zooming-out respectively. 
These operations are global in the sense that the radius r is 
modified similarly for all objects in V ■ We may also modify 
the radius for a specific area of the data set. Consider, for 
example, a user that receives an r-DisC diverse subset S T 
of the results and finds some specific object pi £ S r more 
or less interesting. Then, the user can zoom-in or zoom- 
out by specifying a radius r', r' < r or r' > r, respectively, 
centered in pi. We call these operations local zooming-in and 
local zooming-out respectively. 

To study the size relationship between S r and S r , we 
define the set N^^^pi), ri > n, as the set of objects at 
distance at most r-2 from pi which are at distance at least ri 
from each other, i.e., objects in N T2 (pi) that are independent 
from each other considering the radius n. The following 
lemma bounds the size of N^ i r , 2 (pi) for specific distance 
metrics and dimensionality. 

Lemma 4. Let ri , r2 be two radii with r2 > ri . Then, for 
d = 2: 

(i) if dist is the Euclidean distance: 

\Nr 1>r2 (Pi)\ < 9 \\og (r 2 /ri)] , where [3 = 1 + ^ 

(ii) if dist is the Manhattan distance: 

k£ ira (p0|<4V(M + l),where7= 

Proof. The proof can be found in the Appendix. □ 

Since we want to support an incremental mode of opera- 
tion, the set S r should be as close as possible to the already 
seen result S r . Ideally, S r ' D S r , for r' < r and S r ' C S r , 
for r' > r. We would also like the size of 5"" to be as close as 
possible to the size of the minimum r'-DisC diverse subset. 



If we consider only the coverage condition, that is, only 
r-C diverse subsets, then an r-C diverse subset of V is also 
an r'-C diverse subset of V, for any r' > r. This holds 
because N r (pi) C N r /(pi) for any r' > r. However, a similar 
property does not hold for the dissimilarity condition. In 
particular, a maximal independent diverse subset S r of V 
for r is not necessarily a maximal independent diverse subset 
of V, neither for r' > r nor for r' < r. To see this, note that 
for r' > r, S r may not be independent, whereas for r' < r, 
S r may no longer be maximal. Thus, from Lemma [T] we 
reach the following conclusion: 

Observation 4. In general, there is no monotonic prop- 
erty among the r-DisC diverse and the r'-DisC diverse sub- 
sets of a set of objects V , for r ^ r' . 

For zooming- in, i.e., for r' < r, we can construct r'-DisC 
diverse sets that are supersets of S T by adding objects to 
S T to make it maximal. For zooming-out, i.e., for r' > 
r, in general, there may be no subset of S r that is r'-DisC 
diverse. Take for example the objects of Figure [ ^b) | with S r 
= {pi,P2,pz\ . No subset of 5"" is an r'-DisC diverse subset 
for this set of objects. Next, we detail the zooming-in and 
zooming-out operations. 

3.1 Incremental Zooming-in 

Let us first consider the case of zooming with a smaller 
radius, i.e., r' < r. Here, we aim at producing a small 
independent covering solution S r , such that, S r 3 S r . For 
this reason, we keep the objects of S r in the new r'-DisC 
diverse subset S r and proceed as follows. 

Consider an object of S r , for example pi in Figure Q ^a)| 
Objects at distance at most r' from pi are still covered by 
pi and cannot enter S r . Objects at distance greater than r' 
and at most r may be uncovered and join 5* r . Each of these 
objects can enter S r as long as it is not covered by some 
other object of S r that lays outside the former neighborhood 
of pi. For example, in Figure a) | p4 and ps may enter 5"" 
while p3 can not, since, even with the smaller radius r', P3 
is covered by P2- 

To produce an r'-DisC diverse subset based on an r-DisC 
diverse subset, we consider such objects in turn. This turn 
can be either arbitrary (Zoom-In algorithm) or proceed in a 
greedy way, where at each turn the object that covers the 
largest number of uncovered objects is selected (Greedy-Zoom- 
In, Algorithm HJ. 

Lemma 5. For the set S r generated by the Zoom-In and 
Greedy-Zoom-In algorithms, it holds that: 

(i) S r C 5"'' and 

(ii) \S r '\<N^ r ( P ,)\S r \ 

Proof. Condition (i) trivially holds from step 1 of the 
algorithm. Condition (ii) holds since for each object in S r 
there are at most N^., r (pi) independent objects at distance 
greater than r' from each other that can enter S r . □ 

In pract ice, objects selected to enter S r , such as P4 and p$ 
in Figure[ ^[a)| are likely to cover other objects left uncovered 
by the same or similar objects in S r . Therefore, the size 
difference between S r and 5"" is expected to be smaller than 
this theoretical upper bound. 



Algorithm 2 Greedy-Zoom-In 

Input: A set of objects V, an initial radius r, a solution S r and 

a new radius r' < r. 
Output: An r'-DisC diverse subset of "P. 

1: S r ' «- S r 

2: for all p x G S r do 

3: Color objects in {N r (pi)\N r i (pi)} white 
4: end for 

5: while there exist white objects do 

6: Select the white object pi with the largest \Njr (pj) 

7: Color pi black 

8: S r ' = S r ' U { Pl } 

9: for all Pj G N™ (pi) do 
10: Color pj grey 

11: end for 
12: end while 
13: return S r 



3.2 Incremental Zooming-out 

Next, we consider zooming with a larger radius, i.e., r > 
r. In this case, the user is interested in seeing less and 
more dissimilar objects, ideally a subset of the already seen 
results for r, that is, S r C S r . However, in this 
discussed, in contrast to zooming-in, it may not be possible 
to construct a diverse subset S r that is a subset of 5"". 

Thus, we focus on the following sets of objects: (i) S r \S r 
and (ii) S r \S r . The first set consists of the objects that 
belong to the previous diverse subset but are removed from 
the new one, while the second set consists of the new objects 
added to S r . To illustrate, let us consider for example the 
objects of Figure [ ^b)| and that pi, p2, pa £ S r . Since the 
radius becomes larger, p\ now covers all objects at distance 
at most r' from it. This may include a number of objects 
that also belonged to S r , such as p2- These objects have 
to be removed from the solution, since they are no longer 
dissimilar to pi. However, removing such an object, say p2 
in our example, can potentially leave uncovered a number of 
objects that were previously cove red by p2 (these objects lie 
in the shaded area of Figure[ ^b)[ ). In our example, requiring 
pi to remain in S r means than ps should be now added to 
S r '. 

To produce an r'-DisC diverse subset based on an r-DisC 
diverse subset, we proceed in two passes. In the first pass, 
we examine all objects of S r in some order and remove their 
diverse neighbors that are now covered by them. At the sec- 
ond pass, objects from any uncovered areas are added to S r . 
Again, we have an arbitrary and a greedy variation, denoted 
Zoom-Out and Greedy-Zoom-Out respectively. Algorithm [3] 
shows the greedy variation; the first pass (lines 4-11) con- 
siders S r \S r , while the second pass (lines 12-19) considers 
\^gr^ initially, we color all previously black objects red. 
All other objects are colored white. We consider three vari- 
ations for the first pass of the greedy algorithm: selecting 
the red objects with (a) the largest number of red neighbors, 
(b) the smallest number of red neighbors and (c) the largest 
number of white neighbors. Variations (a) and (c) aim at 
minimizing the objects to be added in the second pass, that 
is, S r \S r , while variation (b) aims at maximizing S r f] S r . 
Algorithm[3]depicts variation (a), where N^(j>i) denotes the 
red neighbors of object pi. 



(a) r-DisC. (b) MaxSum. (c) MaxMin. (d) fc-medoids. (e) r-C. 

Figure 6: Solutions by the various diversification methods for a clustered dataset. Selected objects are shown 
in bold. Solid circles denote the radius r of the selected objects. 



Algorithm 3 Greedy-Zoom-Out(a) 

Input: A set of objects V, an initial radius r, a solution S r and 

a new radius r' > r. 
Output: An r'-DisC diverse subset of V. 

1: S r ' <- 

2: Color all black objects red 

3: Color all grey objects white 

4: while there exist red objects do 

5: Select the red object p; with the largest \N^(pi)\ 

6: Color pi black 

7: S r ' = S r ' U { Pi } 

8: for all pj £ N r /(pi) do 

9: Color pj grey 

10: end for 

11: end while 

12: while there exist white objects do 

13: Select the white object p; with the larger \N^(pi)\ 

14: Color pi black 

15: S r ' = S r ' U {pi} 

16: for all Pj •<= Nj( Pi ) do 

17: Color pj grey 

18: end for 

19: end while 

20: return S r ' 



Lemma 6. For the solution S r generated by the Zoom-Out 
and Greedy-Zoom-Out algorithms, it holds that: 

(i) There are at most N^, r ,(pi) objects in S r \S r . 

(ii) For each object of S r not included in S r , at most B — l 
objects are added to S r . 

Proof. Condition (i) is a direct consequence of the def- 
inition of Nl r ,(pi). Concerning condition (ii), recall that 
each removed object p; has at most B independent neigh- 
bors for r'. Since pi is covered by some neighbor, there are 
at most B—l other independent objects that can potentially 
enter S r ' . □ 

As before, objects left uncovered by objects such as P2 
in Figure [ ^b)| may already be covered by other objects in 
the new solution (consider p4 in our example which is now 
covered by p^). However, when trying to adapt a DisC di- 
verse subset to a larger radius, i.e., maintain some common 
objects between the two subsets, there is no theoretical guar- 
antee that the size of the new solution will be reduced. 

4. COMPARISON WITH OTHER MODELS 

The most widely used diversification models are MaxMin 
and MaxSum that aim at selecting objects that maximize 
/mi N = min PiiPje s dist(pi,pj) and /sum = J2pi,Pj£S dist(pi,pj) 

Pi¥=Pj Pi^Pj 



respectively (e.g. [161 1261 \6\). Let us first compare analyti- 
cally the quality of an r-DisC solution to the optimal values 
of these metrics. 

Lemma 7. Let V be a set of objects, S be an r-DisC di- 
verse subset of V and A > r be the fum distance between 
objects of S. Let S* be an optimal MaxMin subset ofV for 
k = \S\ and A* be the /mm distance for S* . Then, A* < 3 
A. 

Proof. Each object in S* is covered by (at least) one 
object in S. There are two cases, either (i) all objects p*, p% 
£ S*, p* P2, are covered by different objects is S, or (ii) 
there are at least two objects in 5*, p\, P2, pt 7^ P2 that are 
both covered by the same object p in S. Case (i): Let p\ and 
P2 be two objects in S such that dist(pi,p2) = d and p\ and 
p% respectively be the object in S* that each covers. Then, 
by applying the triangle inequality twice, we get: dist(pl , p% ) 

< dist{p\,p\) + dist(p!,p2) < dist(pl,pi) + dist(p 1 ,p2) + 
dist(p2,P2). By coverage, we get: dist{p\,p2) < r + A + r 

< 3 A, thus A* < 3 A. Case (b): Let pt and p% be two objects 
in S* that are covered by the same object p in S. Then, by 
coverage and the triangle inequality, we get dist(p\,p2) < 
dist(pl,p) + dist(p,p2) < 2 r, thus A* < 2 A. □ 

Lemma 8. Let V be a set of objects, S be an r-DisC di- 
verse subset of V and a be the /sum distance between objects 
of S. Let S* be an optimal MaxSum subset of V for k = 
15*1 and a* be the /sum distance for S* . Then, a* < 3 a. 

Proof. We consider the same two cases for the objects 
in S covering the objects in S* as in the proof of Lemma [7] 
Case (i): Let p* and p\ be two objects in S* and p\ and P2 
be the objects in S that cover them respectively. Then, by 
applying the triangle inequality twice, we get: dist(pt,p2) 

< dist(pt,p 1 ) + dist(p 1 ,p2) < dist(pt,p 1 ) + dist(pi,p 2 ) 
+ dist(p2,P2)- By coverage, we get: dist{pt,p2) < 2 r + 
dist(pi,p2) (1). Case (ii): Let pt and P2 be two objects 
in S* that are covered by the same object p in S. Then, 
by coverage and the triangle inequality, we get: dist{p\,p2) 

< dist{p{,p) + dist{p,p* 2 ) < 2 r (2). From (1) and (2), 
we get: E Rr , p - s s*, P *^* dist(p*,p*) < ^^^^ 2r 
+ dist(pi,pj). From independence, V pi,Pj £ S, i 7^ j, 
dist(pi,pj) > r. Thus, a* < 3 a. □ 

Next, we present qualitative results of applying MaxMin 
and MaxSum to a 2-dimensional "Clustered" dataset (Fig- 
ure |6j. To implement MaxMin and MaxSum, we used 
greedy heuristics which have been shown to achieve good so- 
lutions [TO] . In addition to MaxMin and MaxSum, we also 
show results for r-C diversity (i.e., covering but not neces- 
sarily independent subsets for the given r) and fc-medoids, 



a widespread clustering algorithm that seeks to minimize 
ppy S Pi6P dist(pi, c(pi)), where c(pi) is the closest object of 
Pi in the selected subset, since the located medoids can be 
viewed as a representative subset of the dataset. To allow 
for a comparison, we first run Greedy-DisC for a given r and 
then use as k the size of the produced diverse subset. In this 
example, k = 15 for r = 0.7. 

MaxSum diversification and fc-medoids fail to cover all 
areas of the dataset; MaxSum tends to focus on the out- 
skirts of the dataset, whereas fc-medoids reports only central 
points, ignoring outliers. MaxMin performs better in this 
aspect. However, since MaxMin seeks to retrieve objects 
that are as far apart as possible, it fails to retrieve objects 
from dense areas; see, for example, the center areas of the 
clusters in Figure [6] DisC gives priority to such areas and, 
thus, such areas are better represented in the solution. Note 
also that MaxSum and fc-medoids may select duplicate ob- 
jects while DisC and MaxMin do not. We also experimented 
with variations of MaxSum proposed in [26] but the results 
did not differ substantially from the ones in Figure H p)] For 
r-C diversity, the resulting selected set is one object smaller, 
however, the selected objects are less widely spread than in 
DisC. Finally, note that, we are not interested in retrieving 
as representatives subsets that follow the same distribution 
as the input dataset, as in the case of sampling, since such 
subsets will tend to ignore outliers. Instead, we want to 
cover the whole dataset and provide a complete view of all 
its objects, including the distant ones. 

5. IMPLEMENTATION 

Since a central operation in computing DisC diverse sub- 
sets is locating neighbors, we introduce algorithms that ex- 
ploit a spatial index structure, namely, the M-tree [21]. An 
M-tree is a balanced tree index that can handle large vol- 
umes of dynamic data of any dimensionality in general met- 
ric spaces. In particular, an M-tree partitions space around 
some of the indexed objects, called pivots, by forming a 
bounding ball region of some covering radius around them. 
Let c be the maximum node capacity of the tree. Internal 
nodes have at most c entries, each containing a pivot object 
p v , the covering radius r v around p v , the distance of p v from 
its parent pivot and a pointer to the subtree t v . All objects 
in the subtree t v rooted at p v are within distance at most 
equal to the covering radius r v from p v . Leaf nodes have en- 
tries containing the indexed objects and their distance from 
their parent pivot. 

The construction of an M-tree is influenced by the split- 
ting policy that determines how nodes are split when they 
exceed their maximum capacity c. Splitting policies indicate 
(i) which two of the c + 1 available pivots will be promoted 
to the parent node to index the two new nodes (promote 
policy) and (ii) how the rest of the pivots will be assigned to 
the two new nodes (partition policy). These policies affect 
the overlap among nodes. For computing diverse subsets: 

(i) We link together all leaf nodes. This allows us to visit 
all objects in a single left-to-right traversal of the leaf 
nodes and exploit some degree of locality in covering 
the objects. 

(ii) To compute the neighbors N r (pi) of an object p; at 
radius r, we perform a range query centered around 
Pi with distance r, denoted Q(pi,r). Range queries 
can be performed either in a top-down fashion starting 



from the root node or in a bottom-up fashion starting 
from pi. We consider both variations, 
(iii) We build trees using splitting policies that minimize 
overlap. In most cases, the policy that resulted in the 
lowest overlap was (a) promoting as new pivots the 
pivot pi of the overflowed node and the object pj with 
the maximum distance from pi and (b) partitioning 
the objects by assigning each object to the node whose 
pivot has the closest distance with the object. We call 
this policy "MinOverlap" . 

5.1 Computing Diverse Subsets 

The Basic-DisC algorithm selects white objects in ran- 
dom order. The M-tree implementation of Basic-DisC al- 
lows us to consider objects in the order they appear in the 
leaves of the M-tree, thus taking advantage of locality. Upon 
encountering a white object pi, the algorithm colors it black 
and executes a range query Q(pi,r) to retrieve the neigh- 
bors of pi and color them grey. If the overlap among nodes is 
small, the neighbors of an indexed object are expected to re- 
side in nearby leaf nodes, thus such range queries are in gen- 
eral efficient. We can visualize the progress of Basic-DisC 
as gradually coloring all objects in the leaf nodes from left- 
to-right until all objects are either grey or black. 

We make the following observation that allows us to fur- 
ther prune subtrees while executing range queries. Objects 
that are already grey do not need to be colored grey again 
when some other of their neighbors is colored black. 

Pruning Rule: A leaf node that contains no white objects 
is colored grey. When all its children become grey, an inter- 
nal node is colored grey. While executing range queries, the 
top-down search of the tree does not need to follow subtrees 
rooted at grey nodes. 

As the algorithm progresses, more and more nodes become 
grey, and thus, the cost of range queries reduces over time. 
We call this variation Basic-DisC (Pruned). We can visu- 
alize its progress as gradually coloring all tree nodes grey in 
a post-order manner. 

The Greedy-DisC algorithm selects at each iteration the 
white object with the largest white neighborhood. To effi- 
ciently implement Greedy-DisC, we maintain all white ob- 
jects in a structure L' sorted by the size of their white 
neighborhood. For initializing L' , we need to compute the 
size of the white neighborhoods of all objects. Initially, 
(pi) = N r (pi) for every object pi. We opt to compute 
the neighborhood size of each object as we build the M-tree. 
When an object pi in inserted, a range query Q(pi,r) is 
executed. The white neighborhood of pi is initialized to 
\Q(pi,r)\ and the white neighborhoods of all objects re- 
trieved by the range query are incremented by one. We 
found that computing the size of neighborhoods while build- 
ing the tree reduces node accesses up to 45%. 

At each iteration of the algorithm, the first element pi of 
L' is selected and colored black and a range query Q(pi,r) 
is used to retrieve and color grey the neighbors of pi. We 
also need to update the size of the white neighborhoods 
of all affected objects, i.e., all objects in the neighborhood 
of each pj in N r (pi). We consider two variations. The 
first variation, termed Grey-Greedy-DisC, executes an ad- 
ditional range query Q(pj,r) for each of the newly colored 
grey neighbors Pj of pi for locating the neighbors of Pj and 
reducing by one the size of their white neighborhood. The 
second variation, termed White-Greedy-DisC, executes one 



Table 1: Input parameters. 



Parameter 


Default value 


Range 


M-trcc node capacity 


50 


25 - 100 


M-tree splitting policy 


MinOvcrlap 


various 


Datasct cardinality 


10000 


579 - 50000 


Datasct dimensionality 


2 


2 - 10 


Datasct distribution 


normal 


uniform, normal 


Distance metric 


Euclidean 


Euclidean, Hamming 



Table 2: Solution size for the Basic-DisC, Greedy-C 
and the variations of Greedy-DisC algorithms. 



range query for all remaining white objects with distance 
less than or equal to 2r from pi. These are the only white 
objects for which the size of their white neighborhood may 
have changed. As before, we can reduce the cost using the 
pruning rule. We call these variations Grey-Greedy-DisC 
(Pruned) and White-Greedy-DisC (Pruned). 

Finally, Greedy-C considers both grey and white objects 
as candidates. A sorted structure L' has to be maintained 
as well, which now includes both white and grey objects 
and is substantially larger. Furthermore, the pruning rule 
is no longer useful, since grey objects and nodes need to be 
accessed again for updating the size of their white neighbor- 
hood. We use the pruning rule to introduce a faster heuristic 
for computing r-C diverse subsets, called Fast-C. Whenever 
an object pi is colored black, Fast-C executes a range query 
to retrieve its neighbors by traversing the tree bottom-up. 
The query stops climbing up the tree when the first grey 
internal node is met. This may lead to failing to reach and 
thus color grey some of the neighbors of pi that reside in dis- 
tant leaf nodes and thus produce larger results. However, 
in a tree with small overlap, we expect the number of such 
neighbors to be small. 

5.2 Adapting the Radius 

For zooming-in, given an r-DisC diverse subset S r of V, 
we would like to compute an r'-DisC diverse subset S r of 
V, r' < r, such that, S r D S r . A naive implementation 
would require two range queries per object in S r plus any 
additional range queries required to cover newly uncovered 
areas. During the construction of S r , however, objects in 
the corresponding M-tree are already colored black or grey. 
We can exploit this information to efficiently implement the 
Zoom-In and Greedy-Zoom-In algorithms. 

Zooming Rule: Black objects of S r maintain their color in 
S r . Grey objects maintain their color as long as there exists 
a black object at distance at most r' from them. Therefore, 
only grey nodes with no black neighbors at distance r' may 
turn black and enter 5" . 

To take advantage of the Zooming Rule, we extend the 
leaf nodes of the M-tree to include the distance of the in- 
dexed object pi to its closest black neighbor pj , since pi will 
continue to be covered by pj for all r' < dist(pi,pj). 

The Zoom-In algorithm requires one pass of the leaf nodes. 
Each time a grey object pi is encountered, its distance from 
its closest black neighbor is compared against the new radius 
r'. In case this distance is larger than r', Pi is colored black 
and a range query Q(pi, r') is executed to locate any objects 
for which pt is now the closest black neighbor and color 
them grey. At the end of the pass, the black objects of the 
leaves form S r . Greedy-Zoom-In also requires maintenance 
of a sorted structure L' . First, the leaf nodes are traversed, 
grey objects that are now uncovered are colored white and 
inserted into U . Then, the white neighborhoods of objects 
in L' are computed and L' is sorted accordingly. Finally, the 





r 




0.01 


0.02 


0.03 


0.04 


0.05 


0.06 


0.07 


B-DisC 


3839 


1360 


676 


411 


269 


192 


145 


G-DisC 


3260 


1120 


561 


352 


239 


176 


130 


L-Gr-G-DisC 


3384 


1254 


630 


378 


253 


184 


137 


L-Wh-G-DisC 


3293 


1152 


589 


352 


240 


170 


130 


G-C 


3427 


1104 


541 


338 


230 


170 


126 


(a) Uniform (2D - 10000 objects). 




r 




0.01 


0.02 


0.03 


0.04 


0.05 


0.06 


0.07 


B-DisC 


1018 


370 


193 


121 


80 


61 


48 


G-DisC 


892 


326 


162 


102 


69 


52 


43 


L-Gr-G-DisC 


680 


394 


218 


133 


87 


64 


49 


L-Wh-G-DisC 


906 


313 


168 


104 


70 


52 


41 


G-C 


895 


322 


166 


102 


71 


50 


43 


(b) Clustered (2D - 10000 objects). 
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62 
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7 



(c) Cities. 





r 




1 


2 


3 


4 


5 


6 


B-DisC 


461 


237 


103 


34 


9 


4 


G-DisC 


461 


212 


78 


28 


9 


2 


L-Gr-G-DisC 


461 


216 


80 


31 


9 


2 


L-Wh-G-DisC 


461 
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74 


25 


6 
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(d) Cameras. 

first object pi of L' is retrieved and colored black; a range 
query Q(pi, r') is executed and retrieved objects are colored 
grey. Black and grey objects are removed from L' . This 
process is repeated until L' is empty. 

Note that, our pruning technique during the construction 
of S r interferes with the correct computation of the dis- 
tances to the closest back neighbor of the objects. There- 
fore, an additional post-processing step is required after the 
construction of S r to compute these distances. 

Zooming-out algorithms are implemented similarly. A 
sorted structure L' is employed at the first pass of the greedy 
variations to process red objects in the desired order, while 
the same structure is used at the second step to process 
white objects. Finally, for local zooming in an object pi, 
the only difference is that instead of all objects in V , the 
algorithm receives as input only the objects in N r (pi). 

6. EXPERIMENTAL EVALUATION 

In this section, we evaluate the performance of our algo- 
rithms using both synthetic and real datasets. Our synthetic 
datasets consist of multi dimensional objects, where values 
at each dimension are in [0, 1] . Objects are either uniformly 
distributed in space ("Uniform") or form (hyper) spherical 
clusters of different sizes ("Clustered"). We also employ 
two real datasets. The first one ("Cities") is a collection 
of 2-dimensional points representing geographic information 
about 5922 cities and villages in Greece [2]. We normalized 
the values of this dataset in [0, 1]. The second real dataset 
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(a) Uniform. (b) Clustered. 

Figure 7: Node accesses for Basic-DisC, Grey-Greedy 



(c) Cities. (d) Cameras. 

■DisC and Greedy-DS with and without pruning. 
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(a) Uniform. (b) Clustered. 

Figure 8: Node accesses for for Basic-DisC and 

( "Cameras" ) consists of 7 characteristics for 579 digital cam- 
eras from such as brand and storage type. 

We use the Euclidean distance for the synthetic datasets 
and "Cities" , while for " Cameras" , whose attributes are cat- 
egorical, we use dist(pi,pj) = J2i^(PiyPj)^ where S l (pi,pj) 
is equal to 1, if pi and pj differ in the i th dimension and 
otherwise, i.e., the Hamming distance. Note that the choice 
of an appropriate distance metric is an important but or- 
thogonal to our approach issue. 

Table [1] summarizes the values of the input parameters 
used in our experiments. 

Solution Size and Computational Cost: We first com- 
pare our various algorithms in terms of the size of the com- 
puted diverse subset and the cost of its computation. The 
computational cost is measured in terms of node accesses. 
We consider Basic-DisC (B-DisC), the two variation of greedy, 
namely Grey-Greedy-DisC (Gr-G-DisC) and White-Greedy- 
DisC (Wh-G-DisC), as well as Greedy-C (G-C). We also tested 
two "lazy" variations of the greedy algorithms, where to re- 
duce the cost we do not update the white neighborhoods of 
those objects that are further away from the newly colored 
objects. To this end, we perform range queries with a radius 
smaller than r for each grey object (L-Gr-G-DisC) and 2r 
for each black object (L-Wh-G-DisC). We use values r/2 and 
3r/2, respectively. 

Table [2] shows the solution size for different radii. Fig- 
ure [7] reports the cost for B-DisC, Gr-G-DisC and G-C and 
the cost savings when the pruning rule of Section [5] is em- 
ployed for Basic-DisC and Grey-Greedy-DisC (as previ- 
ously detailed, this pruning cannot be applied to Greedy-C). 
Grey-Greedy-DisC locates a smaller DisC diverse subset than 
Basic-DisC in all cases. This, however, has the trade-off 
of increased computational cost. The additional computa- 
tional cost becomes more significant as the radius increases. 
The reason for this is that Grey-Greedy-DisC performs sig- 
nificantly more range queries than Basic-DisC. As the ra- 
dius increases, objects have more neighbors and, thus, more 
M-tree nodes need to be accessed in order to retrieve them, 
color them and update the size of the neighborhoods of their 



(c) Cities. (d) Cameras. 

all variations of Greedy-DisC with pruning. 

neighbors. On the contrary, the cost of Basic-DisC is re- 
duced when the radius increases, since it does not need to 
update the size of any neighborhood. For larger radii, more 
objects are colored grey by each selected (black) object and, 
therefore, less range queries are performed. Greedy-C has 
similar behavior with Grey-Greedy-DisC in terms of solu- 
tion size. This means that raising the independence assump- 
tion does not always lead to smaller diverse subsets as one 
might expect. Note that, the computed diverse subsets by 
all heuristics for the "Clustered" dataset are smaller than for 
the "Uniform" one, since objects are generally more similar 
to each other. Both heuristics benefit from pruning (up to 
50% for small radii). 

We also experimented with employing bottom-up rather 
than top-down range queries. At most cases, the benefit 
in node accesses was less than 5%. The Fast-C heuristic 
described in Section [5] required up to 30% less node ac- 
cesses than Greedy-C, while computing similar sized solu- 
tions. However, the solutions had a larger percentage of 
independent objects than those of Greedy-C. We omit the 
relative figures due to space limitations. 

Figurc[5]comparcs Grey-Greedy-DisC with White-Greedy- 
DisC and their corresponding "lazy" variations. We see that 
White-Greedy-DisC performs better than Grey-Greedy-DisC 
for the clustered dataset as r increases. This is because in 
this case, grey objects share many common white neighbors 
which are accessed multiple times by Grey-Greedy-DisC for 
updating their white neighborhood size and only once by 
White-Greedy-DisC. The lazy variations can further reduce 
the computational cost of the heuristics with the trade-off 
of slightly larger solution sizes (Table [2}. 

In the rest of this section, unless otherwise noted, we use 
the (Grey-) Greedy-DisC (Pruned) heuristic. 

Impact of Dataset Cardinality and Dimensionality: 

For this experiment, we employ the "Clustered" dataset and 
vary its cardinality from 5000 to 15000 objects and its di- 
mensionality from 2 to 10 dimensions. Figure [9] shows the 
corresponding solution size and computational cost as com- 
puted by the Greedy-DisC heuristic. We observe that the 
solution size is more sensitive to changes in cardinality when 




(a) Solution size. (b) Node accesses. (c) Solution size. (d) Node accesses. 

Figure 9: Varying (a)-(b) cardinality and (c)-(d) dimensionality. 



the radius is small. The reason for this is that for large 
radii, a selected object covers a large area in space. There- 
fore, even when the cardinality increases and there are many 
available objects to choose from, these objects are quickly 
covered by the selected ones. In Figure | ^b)| the increase 
in the computational cost is due to the increase of range 
queries required to maintain correct information about the 
size of the white neighborhoods. 

Increasing the dimensionality of the dataset causes more 
objects to be selected as diverse as shown in Figure [ ^fc)| 
This is due to the "curse of dimensionality" effect, since 
space becomes sparser at higher dimensions. The compu- 
tational cost may however vary as dimensionality increases, 
since it is influenced by the cost of computing the neighbor- 
hood size of the objects that are colored grey. 

Impact of M-tree Characteristics: Next, we evaluate 
how the characteristics of the employed M-trees affect the 
computational cost of computed DisC diverse subsets. Note 
that, different tree characteristics do not have an impact on 
which objects are selected as diverse. 

Different degree of overlap among the nodes of an M-tree 
may affect its efficiency for executing range queries. To 
quantify such overlap, we employ the fat-factor [24] of the 
tree defined as: _ , 



where Z denotes the total number of node accesses required 
to answer point queries for all objects stored in the tree, n 
the number of these objects, h the height of the tree and 
m the number of nodes in the tree. Ideally, the tree would 
require accessing one node per level for each point query 
which yields a fat-factor of zero. The worst tree would visit 
all nodes for every point query and its fat-factor would be 
equal to one. 

We created various M-trees using different splitting poli- 
cies which result in different fat-factors. We present results 
for four different policies. The lowest fat-factor was ac- 
quired by employing the "MinOverlap" policy. Selecting as 
new pivots the two objects with the greatest distance from 
each other resulted in increased fat-factor. Even higher fat- 
factors were observed when assigning an equal number of 
objects to each new node (instead of assigning each object 
to the node with the closest pivot) and, finally, selecting 
the new pivots randomly produced trees with the highest 
fat-factor among all policies. 

Figure[lO]reports our results for our uniform and clustered 
2-dimensional datasets with cardinality equal to 10000. For 
the uniform dataset, we see that a high fat-factor leads to 
more node accesses being performed for the same solution. 
This is not the case for the clustered dataset, where objects 
are gathered in dense areas and thus increasing the fat-factor 
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Figure 10: Varying M-tree fat-factor. 

does not have the same impact as in the uniform case, due to 
pruning and locality. As the radius of the computed subset 
becomes very large, the solution size becomes very small, 
since a single object covers almost the entire dataset, this is 
why all lines of Figure [10] begin to converge for r > 0.7. 

We also experimented with varying the capacity of the 
nodes of the M-tree. Trees with smaller capacity require 
more node accesses since more nodes need to be recovered 
to locate the same objects; when doubling the node capacity, 
the computational cost was reduced by almost 45%. 

Zooming: In the following, we evaluate the performance 
of our zooming algorithms. We begin with the zooming- 
in heuristics. To do this, we first generate solutions with 
Greedy-DisC for a specific radius r and then adapt these so- 
lutions for radius r'. We use Greedy-DisC because it gives 
the smallest sized solutions. We compare the results to the 
solutions generated from scratch by Greedy-DisC for the 
new radius. The comparison is made in terms of solution 
size, computational cost and also the relation of the three 
solutions as measured by the Jaccard distance. Figure QT] 
and Figure [12] report the corresponding results for differ- 
ent radii. Due to space limitations, we report results for 
the "Clustered" and "Cities" datasets. Similar results are 
obtained for the other datasets as well. Each solution re- 
ported for the zooming-in algorithms is adapted from the 
Greedy-DisC solution for the immediately larger radius and, 
thus, the x-axis is reversed for clarity; e.g., the zooming so- 
lutions for r = 0.02 in Figure lll^a)] and Figure ll^ta)| are 
adapted from the Greedy-DisC solution for r = 0.03. 

We observe that the zooming-in heuristics provide similar 
solution sizes with Greedy-DisC in most cases, while their 
computational cost is smaller, even for Greedy-Zoom-In. 
More importantly, the Jaccard distance of the adapted solu- 
tions for r' to the Greedy-DisC solution for r is much smaller 
than the corresponding distance of the Greedy-DisC solu- 
tion for r' (Figure [13]) . This means that computing a new 
solution for r' from scratch changes most of the objects re- 
turned to the user, while a solution computed by a zooming- 
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Figure 11: Solution size for zooming-in. 
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Figure 12: Node accesses for zooming-in. 

in heuristic maintains many common objects in the new so- 
lution. Therefore, the new diverse subset is intuitively closer 
to what the user expects to receive. 

Figure[l4]and Figure[T5]show corresponding results for the 
zooming-out heuristics. The Greedy-Zoom-Qut (c) heuristic 
achieves the smallest adapted DisC diverse subsets. How- 
ever, its computational cost is very high and generally ex- 
ceeds the cost of computing a new solution from scratch. 
Greedy-Zoom-Out (a) also achieves similar solution sizes with 
Greedy-Zoom-Out (c) , while its computational cost is much 
lower. The non-greedy heuristic has the lowest computa- 
tional cost. Again, all the Jaccard distances of the zooming- 
out heuristics to the previously computed solution are smaller 
than that of Greedy-DisC (Figure [16}, which indicates that 
a solution computed from scratch has only a few objects in 
common from the initial DisC diverse set. 

7. RELATED WORK 

Other Diversity Definitions: Diversity has recently at- 
tracted a lot of attention as a means of enhancing user sat- 
isfaction [2H1 [H EH [5] ■ Diverse results have been defined in 
various ways [10j . namely in terms of content (or similarity), 
novelty and semantic coverage. Similarity definitions (e.g., 
[30] ) interpret diversity as an instance of the p-dispersion 
problem [13] whose objective is to choose p out of n given 
points, so that the minimum distance between any pair of 
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Figure 14: Solution size for zooming-out. 
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Figure 15: Node accesses for zooming-out. 

chosen points is maximized. Our approach differs in that 
the size of the diverse subset is not an input parameter. 
Most current novelty and semantic coverage approaches to 
diversification (e.g., [51 151 [251 114] ) rely on associating a di- 
versity score with each object in the result and then either 
selecting the top-fc highest ranked objects or those objects 
whose score is above some threshold. Such diversity scores 
are hard to interpret, since they do not depend solely on 
the object. Instead, the score of each object is relative to 
which objects precede it in the rank. Our approach is fun- 
damentally different in that we treat the result as a whole 
and select DisC diverse subsets of it that fully cover it. 

Another related work is that of [18] that extends near- 
est neighbor search to select k neighbors that are not only 
spatially close to the query object but also differ on a set of 
predefined attributes above a specific threshold. Our work is 
different since our goal is not to locate the nearest and most 
diverse neighbors of a single object but rather to locate an 
independent and covering subset of the whole dataset. 

The problem of diversifying continuous data has been re- 
cently considered in [12U211I2U] using a number of variations 
of the MaxMin and MaxSum diversification models. 

Finally, another related method for selecting representa- 
tive results, besides diversity-based ones, is fc-medoids, since 
medoids can be viewed as representative objects (e.g., [19]). 
However, medoids may not cover all the available space. 
Medoids were extended in [5] to include some sense of rele- 
vance (priority medoids). 
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Figure 13: Jaccard distance for zooming-in. 
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Figure 16: Jaccard distance for zooming-out. 



Results from Graph Theory: The properties of inde- 
pendent and dominating (or covering) subsets have been 
extensively studied. A number of different variations ex- 
ist. Among these, the Minimum Independent Dominat- 
ing Set Problem (which is equivalent to the r-DisC diverse 
problem) has been shown to have some of the strongest neg- 
ative approximation results: in the general case, it cannot 
be approximated in polynomial time within a factor of n}~ 1 
for any e > unless P = NP \17\ . However, some approxi- 
mation results have been found for special graph cases, such 
as bounded degree graphs [7]. In our work, rather than pro- 
viding polynomial approximation bounds for DisC diversity, 
we focus on the efficient computation of non-minimum but 
small DisC diverse subsets. There is a substantial amount 
of related work in the field of wireless networks research, 
since a Minimum Connected Dominating Set of wireless 
nodes can be used as a backbone for the entire network [23] . 
Allowing the dominating set to be connected has an impact 
on the complexity of the problem and allows different algo- 
rithms to be designed. 

8. SUMMARY AND FUTURE WORK 

In this paper, we proposed a novel, intuitive definition of 
diversity as the problem of selecting a minimum represen- 
tative subset 5 of a result V, such that each object in V is 
represented by a similar object in S and that the objects 
included in S are not similar to each other. Similarity is 
modeled by a radius r around each object. We call such 
subsets r-DisC diverse subsets of P. We introduced adap- 
tive diversification through decreasing r, termed zooming- in, 
and increasing r, called zooming-out. Since locating min- 
imum r-DisC diverse subsets is an NP-hard problem, we 
introduced heuristics for computing approximate solutions, 
including incremental ones for zooming, and provided corre- 
sponding theoretical bounds. We also presented an efficient 
implementation based on spatial indexing. 

There are many directions for future work. We are cur- 
rently looking into two different ways of integrating rel- 
evance with DisC diversity. The first approach is by a 
"weighted" variation of the DisC subset problem, where each 
object has an associated weight based on its relevance. Now 
the goal is to select a DisC subset having the maximum sum 
of weights. The other approach is to allow multiple radii, 
so that relevant objects get a smaller radius than the ra- 
dius of less relevant ones. Other potential future directions 
include implementations using different data structures and 
designing algorithms for the online version of the problem. 
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Appendix 

Proof of L omnia [31 Let pi, p2 be two independent neigh- 
bors of p. Then, it must hold that Zp\pp2 (in the Euclidean 
space) is larger than We will prove this using contra- 
diction, pi, p2 are neighbors of p so they must reside in 
the shaded area of Figure ll^ta)) Without loss of generality, 
assume that one of them, say pi, is aligned to the vertical 
axis. Assume that Zp±pp2 < \. Then cos(ZpipiP2) > 
It holds that b < r and c < r, thus, using the cosine law we 
get that a 2 < r 2 (2 - ^2) (1). The Manhattan distance of 
P\,P2 is equal to x + y — \J 'a' 2 + 2xy (2). Also, the following 
hold: x = Vfr 2 — z 2 , y — c — z and z = b cos(ZpipiP2) > 
Substituting z and c in the first two equations, we get 
x < \ and y < r - ^p. From (1),(2) we now get that 
x + y < r, which contradicts the independence of pi and p2. 
Therefore, p can have at most (2-k/j) — 1 = 7 independent 
neighbors. 

Proof of Theorem [2} We consider that inserting a node u 
into S has cost 1. We distribute this cost equally among all 
covered nodes, i.e., after being labeled grey, nodes are not 
charged anymore. Assume an optimal minimum dominating 
set S* . The graph G can be decomposed into a number of 
star-shaped subgraphs, each of which has one node from S* 
at its center. The cost of an optimal minimum dominating 
set is exactly 1 for each star-shaped subgraph. We show 
that for a non-optimal set S, the cost for each star-shaped 
subgraph is at most In A, where A is the maximum degree 
of the graph. Consider a star-shaped subgraph of S* with u 
at its center and let (u) be the number of white nodes 
in it. If a node in the star is labeled grey by Greedy-DS, 
these nodes are charged some cost. By the greedy condition 
of the algorithm, this cost can be at most l/|Af^(M)| per 
newly covered node. Otherwise, the algorithm would rather 
have chosen u for the dominating set because u would cover 



at least | TV,^ | nodes. In the worst case, no two nodes 
in the star of u are covered at the same iteration. In this 
case, the first node that is labeled grey is charged at most 
l/(<5(u) + 1), the second node is charged at most l/8(u) and 
so on, where 5(u) is the degree of u. Therefore, the total 
cost of a star is at most: 

jdrr + J& + ■ = * * (A+1) * ln A 

where H(i) is the i th harmonic number. Since a minimum 
dominating set is equal or smaller than a minimum indepen- 
dent dominating set, the theorem holds. 

Proof of Lemma [4](i): For the proof, we use a technique 
for partitioning the annulus between ri and r2 similar to 
the one in |22| a nd [27] • Let n be the radius of an object 
p (Figure 1 1 7[I b)| and a a real number with < ct < -| . 
We draw circles around the object p with radii (2cosa) Xp , 
(2cosa) Xp+1 , (2cosa) Xp+2 , (2cosa) Vp ~ 1 , (2cosa) v ", such 
that (2cosa) Xp < ri and (2cosa) Xp+1 > r 2 and (2cosa) Vp ~ 1 < 

r 2 and (2cosa) Vp > r 2 . It holds that x p = ^ ln ^co B a ) j and 

fp = [ in(2cos a) j • ^ n this wa y> the area around p is parti- 
tioned into y v — x p annuluses plus the ri-disk around p. 

Consider an annulus A. Let pi and P2 be two neighbors 
of p in A with dist(jp\,p2) > n. Then, it must hold that 
Zp\pp2 > a. To see this, we draw two segments from p 
crossing the inner and outer circles of A at o, 6 and c, d 
such that pi resides in pb and Zbpd = a, as shown in the 
figure. Due to the construction of the circles, it holds that 
Jp6j _ \_pdi _ 2 COSQ , From the cosine law for pad, we get 

pc| \P a \ 

that \ad\ = \pa\ and, therefore, it holds that |cfe| = \ad\ = 
\pa\ = \pc\. Therefore, for any object p3 in the area abed of 
A, it holds that \ppa\ > \bp-,i\ which means that all objects in 
that area are neighbors of pi, i.e., at distance less or equal 
to r± . For this reason, p2 must reside outside this area which 
means that Zp\pp2 > a. Based on this, we see that there 
exist at most — — 1 independent (for ri) nodes in A. 

The same holds for all annuluses. Therefore, we have at 
most (y p —x p ) (^7 — l) independent nodes in the annuluses. 
For < a < j , this has a minimum when a is close to ^ and 

that minimum value is 9 [ ^ios^/s)) ] = 9 \ lo Sp( r '2/ri)] , 

where /3 = 

Proof of Lemma|||[ii): Let r\ be the radius of an object p. 
We draw Manhattan circles around the object p with radii 
n , 2n , . . . until the radius r2 is reached. In this way, the 
area around p is partitioned into 7 = J~ r2 ^ ri j Manhattan 

annuluses plus the n-Manhattan-disk around p. 

Consider an annulus A. The objects shown in Figure ll'i|[c)| 
cover the whole annulus and their Manhattan pairwise dis- 
tances are all greater or equal to r\ . Assume that the annu- 
lus spans among distance ir\ and (i + l)ri from p, where i is 

an integer with i > 1. Then, |o6| = J 2 (in + ri/2) 2 . Also, 

for two objects pi, P2 it holds that IP1P2I = \l 2 (ri/2) 2 . 
Therefore, at one quadrant of the annulus there are -r—K 

' ^ \V1V2\ 

= 2i + 1 independent neighbors which means that there are 
4(2i + 1) independent neighbors in A. Therefore, there are 
in total X/7=i 4(2* + 1) independent (for r\) neighbors of p. 



